Spatial transcriptomics data analysis using R

Phillip Nicol

Introduction

Outline

  1. Review of Spatial transcriptomics technologies

  2. Spatial transcriptomics data processing and storage in R

  3. Finding spatially variable genes

  4. Dimension reduction and clustering

How to follow along

Downloading the GitHub repository from the command line:

git clone https://github.com/phillipnicol/UPR_ST_workshop.git

Or navigate to the GitHub repository and download the ZIP.

Installing and loading required packages

Running the following script will install and load all of the necessary packages.

source("install_packages.R")

Review of Spatial transcriptomics (ST) technology

Many of the most popular technologies for ST can be divided into two groups (Righelli et al. 2022):

  • Spot-based: Next generation sequencing (NGS) with spatial barcoding

  • Molecule-based: In-situ imaging of individual molecules

Spot-based: 10X Genomics Visium

10X Genomics Visium Youtube Video

Molecule-based: Small molecule fluorescence in situ hybridization (smFISH)

  • Colored probes attach to mRNA transcript from a target gene.

  • Quantify expression by imaging

  • Key challenge is extending to many genes

Figure 4A of Ji and Van Oudenaarden (2012)

Molecule-based: Multiplexed error robust FISH (MERFISH)

MERFISH uses (error robust) combinatorial labeling to increase the number of gene transcripts that can be measured.

Basic idea: assign a \(N\)-bit binary string to each gene. Then \(2^N\) genes can be encoded and measured after \(N\) sequential rounds of smFISH.

Fig 1A of Chen et al. (2015)

Comparison

Visium (Spot-based) Molecule-based
Spatial Resolution 1-10 cells per spot sub-cellular
Number of genes Whole transcriptome (20,000+) 100-10,000 (MERFISH)
Detection efficiency ~15% ~95%
Accessibility Commercially available; raw data FASTQ Raw images can be large and difficult to analyze

Most ST studies choose Visium

Figure from (Moses and Pachter 2022)

Most ST studies choose R

Figure from (Moses and Pachter 2022)

Spatial Transcriptomics Data Processing

Upstream data processing

Figure from (Moses and Pachter 2022)

Visium data format

Once processed, ST data can be represented using two matrices.

  • Count matrix \(Y\): For each of \(J\) spots record the number of reads from each of \(I\) genes.

  • Coordinate matrix \(X\): For each of \(J\) spots record the 2-dimensional spatial coordinate.

Storing ST data in R

The SpatialExperiment package provides a container to store the count and coordinate matrix. Righelli et al. (2022)

Creating a SpatialExperiment object

10X genomics Visium data

The 10X genomics website has example ST datasets.

For this workshop, we will download samples from human cerebral cortex:

Downloading from the website

The data can be downloaded by navigating to the website or directly from the terminal (Unix):

mkdir cortex
cd cortex 

curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz

curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_spatial.tar.gz

tar xf V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz

tar xf V1_Human_Brain_Section_1_spatial.tar.gz

Downloading data

If downloading the data from the terminal does not work on your computer, the folder can be downloaded from Google Drive. This link will be deleted within a week.

Exercise: Creating the object

dir <- "./cortex"

### Step 1: Load the count matrix
require(DropletUtils)
fnm <- file.path(dir, "raw_feature_bc_matrix")
sce <- read10xCounts(fnm) #Single cell object

### Step 2: Load the tissue image
img <- readImgData(
    path = file.path(dir, "spatial"),
    sample_id = "brain")

# Step 3: Read spatial coordinates
fnm <- file.path(dir, "spatial", "tissue_positions_list.csv")
xy <- read.csv(fnm, header = FALSE,
    col.names = c(
        "barcode", "in_tissue", "array_row", "array_col",
        "pxl_row_in_fullres", "pxl_col_in_fullres"))
ix <- match(sce$Barcode, xy$barcode) #Get cells in correct order
xy <- xy[ix,]

### Step 4: Construct feature metadata 
rd <- DataFrame(symbol = rowData(sce)$Symbol)

### Step 5: Create the object
spe <- SpatialExperiment(
    assays = list(counts = assay(sce)),
    rowData = rd, 
    colData = DataFrame(xy), 
    spatialCoordsNames = c("pxl_col_in_fullres", "pxl_row_in_fullres"),
    imgData = img,
    sample_id = "brain")

### Step 6: Save the object 
saveRDS(spe, file=file.path(dir,"spe.RDS"))

Source: SpatialExperiment Vignette

Plotting an image of the tissue

spi <- getImg(spe)
plot(as.raster(spi))

Removing spots not in the tissue

df <- data.frame(x=spatialCoords(spe)[,1], y=spatialCoords(spe)[,2],
                 in_tissue=as.character(spe$in_tissue))
p <- ggplot(df,aes(x=x,y=y,color=in_tissue))+
  geom_point()+
  scale_color_manual(values=c("grey", "red"))+
  theme_bw()
p

spe <- spe[,spe$in_tissue == 1] #This is how you subset by spots/samples

More example ST data

For additional examples of data in the SpatialExperiment format, a good package is STexampleData :

### Example dataset 
spe2 <- STexampleData::ST_mouseOB()

Finding Spatially Variable Genes

What is a spatially variable gene (SVG)?

A spatially variable gene is one whose expression depends on spatial location

Figure from Zhu, Sun, and Zhou (2021)

Exercise: Plotting spatially variable genes

In the human cerebral cortex, the genes MOBP, NPY, SNAP25, and PCP4 were previously reported by Weber et al. (2023) to be spatially variable. We will verify this by plotting.

Problem: Instead of gene symbols we are given ENSEMBL ID:

head(rownames(spe))
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000239906"

Exercise: Plotting SVGs

### Convert ENSEMBL ID to gene symbol
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- rownames(spe)
G_list <- getBM(filters="ensembl_gene_id",
                attributes= c("ensembl_gene_id","hgnc_symbol"),
                values=genes,
                mart=mart)
no.match <- which(G_list$hgnc_symbol == "")
G_list[no.match,2] <- G_list[no.match,1]
ix <- match(G_list$ensembl_gene_id, rownames(spe))
rownames(spe)[ix] <- G_list$hgnc_symbol
interesting_genes <- which(rownames(spe) %in% c("MOBP","PCP4", "SNAP25","NPY"))
### Normalize count matrix and subset to interesting genes
spe <- logNormCounts(spe)

Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])

## Switch rows and columns and make dataframe 
df <- as.data.frame(t(Y.sub))

## Add spatial coordinates 
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]
### https://github.com/lmweber/nnSVG-analyses
### Author: Lukas Weber
df <- melt(df, id.vars=c("x","y"))

p <- ggplot(data=df,aes(x=x,y=y,color=value)) +
  geom_point(size=0.05) + 
  coord_fixed() +
  scale_color_gradientn(colors=c("gray90", "blue", "black"),
                       breaks=c(0,3,6)) + 
  facet_wrap(~variable, nrow=2) + 
  theme_bw() + 
  guides(color=guide_colorbar(ticks=FALSE)) + 
  labs(color="log count") + 
  theme(strip.text = element_text(face = "italic"), 
        panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())
p

Finding SVGs automatically

  • A statistical method can automatically identify spatially variable genes without relying on previous knowledge.

  • There are several existing methods, but it is still an area of active research.

SPARK-X

Fast non-parametric test to see if the distribution of gene expression \(Y_{i \cdot}\) depends on the spatial locations \(X\). More specifically, does the difference in gene expression between cell \(i\) and \(j\) depend on the spatial distance between cell \(i\) and \(j\) ? Zhu, Sun, and Zhou (2021)

library(SPARK)

Running SPARK-X

## Remove mitochondrial genes
mito.genes <- which(grepl("^MT-",rownames(spe)))
spe <- spe[-mito.genes,]

res <- sparkx(
    count_in = counts(spe)[rowSums(counts(spe)) > 10,], 
    locus_in = spatialCoords(spe), 
    verbose = FALSE)

Genes with the smallest p-values

interesting_genes <- rownames(res$res_mtest)[order(res$res_mtest$combinedPval)[1:10]]

spe <- logNormCounts(spe)

Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])

## Switch rows and columns and make dataframe 
df <- as.data.frame(t(Y.sub))

## Add spatial coordinates 
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]

### Construct plot
df <- melt(df, id.vars=c("x","y"))

p <- ggplot(data=df,aes(x=x,y=y,color=value)) +
  geom_point(size=0.05) + 
  coord_fixed() +
  scale_color_gradientn(colors=c("gray90", "blue", "black"),
                       breaks=c(0,4,8)) + 
  facet_wrap(~variable, nrow=2) + 
  theme_bw() + 
  guides(color=guide_colorbar(ticks=FALSE)) + 
  labs(color="log count") + 
  theme(strip.text = element_text(face = "italic"), 
        panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())
p

Plotting genes with the largest p-values (least spatially variable)

Dimension Reduction and Clustering

Dimension Reduction

Our dataset has over \(30,000\) genes. However, many appear to be highly correlated (see the top SVG plot we made earlier).

If we account for the strong correlation, we may be able to summarize the data using a much smaller number of genes, thus reducing the “dimension”.

Principal Component Analysis (PCA)

PCA begins by finding the line that best approximates the data (PC dimension 1). Next, it finds a line perpendicular to PC dimension 1 that best approximates the data (PC dimension 2). Then repeat to find PC dimension 3, 4, etc.

Image source

Principal Component Analysis (PCA)

To reduce the data to dimension \(k\), we can project each data point onto the linear space spanned by the first \(k\) PC dimensions.

For example, when \(k = 1\) we project each point onto a line. When \(k = 2\), we project each point onto a plane.

Simulation: Generating data

set.seed(323)
data <- data.frame(mvrnorm(n=100, mu=c(0,0), Sigma=matrix(c(1,0.95,0.95,1),nrow=2)))
data$color <- data[,1]
p <- ggplot(data,aes(x=X1,y=X2,color=color)) +
  geom_point(size=2) + 
  scale_color_gradient(low="lightblue", high="firebrick") + 
  theme_bw() + 
  guides(color="none") + 
  xlab("x") + ylab("y")
p

Simulation: Projecting onto PC 1

pca <- prcomp(data[,c("X1", "X2")])

data$pc1 <- pca$x[,1]
p <- ggplot(data=data,aes(x=pc1,y=0,color=color)) + 
  geom_point(size=2) + 
  scale_color_gradient(low="lightblue", high="firebrick") + 
  theme_bw() + 
  theme(axis.title.y=element_blank(),
           axis.text.y=element_blank(), 
           axis.ticks.y=element_blank()) + 
  guides(color="none") + 
  xlab("PC1 Score")
p

Applying PCA to ST data

For computational speed, it is standard practice to subset to the ~2000 most highly variable genes before running PCA:

top.hvgs <- getTopHVGs(spe, n=2000) #Get the top 2000 genes
spe <- fixedPCA(spe, subset.row=top.hvgs) #Running PCA 

scran Vignette

Finding the genes in PC dimension 1

What do you notice about the genes that contribute to PC dimension 1?

pc_dims <- attr(reducedDim(spe), "rotation")

pc_dim1 <- pc_dims[,1]

#Top 5 genes with positive weight
names(pc_dim1)[order(pc_dim1, decreasing=TRUE)[1:5]]
[1] "MOBP"  "MBP"   "PLP1"  "GFAP"  "BCAS1"
#Top 5 genes with negative weight
names(pc_dim1)[order(pc_dim1)[1:5]]
[1] "NEFL"   "SNAP25" "NEFM"   "SYT1"   "VSNL1" 

Projecting onto the first 3 PC dimensions

df <- data.frame(x=spatialCoords(spe)[,1],
                 y=spatialCoords(spe)[,2],
                 PC1=reducedDim(spe)[,1],
                 PC2=reducedDim(spe)[,2],
                 PC3=reducedDim(spe)[,3])

df <- df |> pivot_longer(cols=c(PC1,PC2,PC3))
p <- ggplot(data=df,aes(x=x,y=y,color=value)) + 
  geom_point(size=0.5) + 
  coord_fixed()+
  scale_color_gradient2(low="blue",mid="grey", high="red") +
  theme_bw()+
  facet_wrap(~name,nrow=1)
p

Clustering

The PCA results show that there are groups of spots that have very similar expression levels. By applying a clustering algorithm we can explicitly define these groups of cells. Many approaches for clustering first build the shared nearest neighbor (SNN) graph of the cells.

By j_ham3 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17125894

Building the SNN graph and applying a community detection algorithm

set.seed(234)
g <- buildSNNGraph(spe, use.dimred="PCA")
cluster <- igraph::cluster_walktrap(g)$membership

Plotting the clusters

df <- data.frame(x=spatialCoords(spe)[,1],
                 y=spatialCoords(spe)[,2],
                 cluster=as.character(cluster))
p <- ggplot(data=df,aes(x=x,y=y,color=cluster)) + 
  geom_point() + 
  theme_bw()
p

Finding marker genes

A marker gene is one that is differentially expressed in a particular cluster. Marker genes are used to better understand the biological function of the identified clusters.

colLabels(spe) <- factor(cluster)
markers <- scoreMarkers(spe)

Printing the top 5 markers in each cluster

top5 <- matrix(NA, nrow=length(markers), ncol=5)
rownames(top5) <- paste("Cluster", 1:length(markers))
colnames(top5) <- paste("Marker", 1:5)
for(i in 1:length(markers)) {
  top5[i,] <- rownames(markers[[i]])[order(markers[[i]]$mean.AUC, decreasing=TRUE)[1:5]]
}
top5 <- as.data.frame(top5)
knitr::kable(top5)
Marker 1 Marker 2 Marker 3 Marker 4 Marker 5
Cluster 1 ENSG00000269028 SPARC AQP4 SPARCL1 FOS
Cluster 2 MBP PLP1 MOBP BCAS1 CNP
Cluster 3 SNAP25 SYT1 NEFL TUBA1B STMN2
Cluster 4 SLC1A2 MT3 SLC1A3 ATP1A2 ENSG00000269028
Cluster 5 ENC1 VSNL1 NRGN YWHAH MEF2C
Cluster 6 MOBP MBP PLP1 BCAS1 ERMN
Cluster 7 MAP1A MAP2 SLC17A7 PRKCB NPTX1

Top marker genes

Per (Maynard et al. 2021),

  • MOBP is a white matter/oligodendrocyte marker (Cluster 2)

  • SNAP25 is a gray matter/neuron marker (Cluster 3)

Additional exercises

Repeat the analysis of finding SVGs and clusters for the data (of your choice) in the STexampleData package

## Possible options 
spe <- STexampleData::seqFISH_mouseEmbryo()
spe <- STexampleData::SlideSeqV2_mouseHPC()
spe <- STexampleData::ST_mouseOB()
spe <- STexampleData::Visium_humanDLPFC()
spe <- STexampleData::Visium_mouseCoronal()

References

Chen, Kok Hao, Alistair N. Boettiger, Jeffrey R. Moffitt, Siyuan Wang, and Xiaowei Zhuang. 2015. “Spatially Resolved, Highly Multiplexed RNA Profiling in Single Cells.” Science 348 (6233): aaa6090. https://doi.org/10.1126/science.aaa6090.
Ji, Ni, and Alexander Van Oudenaarden. 2012. “Single Molecule Fluorescent in Situ Hybridization (smFISH) of C. Elegans Worms and Embryos.” WormBook, December, 1–16. https://doi.org/10.1895/wormbook.1.153.1.
Maynard, Kristen R., Leonardo Collado-Torres, Lukas M. Weber, Cedric Uytingco, Brianna K. Barry, Stephen R. Williams, Joseph L. Catallini, et al. 2021. “Transcriptome-Scale Spatial Gene Expression in the Human Dorsolateral Prefrontal Cortex.” Nature Neuroscience 24 (3): 425–36. https://doi.org/10.1038/s41593-020-00787-0.
Moses, Lambda, and Lior Pachter. 2022. “Museum of Spatial Transcriptomics.” Nature Methods 19 (5): 534–46. https://doi.org/10.1038/s41592-022-01409-2.
Righelli, Dario, Lukas M Weber, Helena L Crowell, Brenda Pardo, Leonardo Collado-Torres, Shila Ghazanfar, Aaron T L Lun, Stephanie C Hicks, and Davide Risso. 2022. “SpatialExperiment: Infrastructure for Spatially-Resolved Transcriptomics Data in R Using Bioconductor.” Edited by Valentina Boeva. Bioinformatics 38 (11): 3128–31. https://doi.org/10.1093/bioinformatics/btac299.
Weber, Lukas M., Arkajyoti Saha, Abhirup Datta, Kasper D. Hansen, and Stephanie C. Hicks. 2023. “nnSVG for the Scalable Identification of Spatially Variable Genes Using Nearest-Neighbor Gaussian Processes.” Nature Communications 14 (1): 4059. https://doi.org/10.1038/s41467-023-39748-z.
Zhu, Jiaqiang, Shiquan Sun, and Xiang Zhou. 2021. “SPARK-X: Non-Parametric Modeling Enables Scalable and Robust Detection of Spatial Expression Patterns for Large Spatial Transcriptomic Studies.” Genome Biology 22 (1): 184. https://doi.org/10.1186/s13059-021-02404-0.